Parallel Out-of-Core Divide-and-Conquer Techniques with Application to Classification Trees
نویسندگان
چکیده
Classification is an important problem in the field of data mining. Construction of good classifiers is computationally intensive and offers plenty of scope for parallelization. Divide-and-conquer paradigm can be used to efficiently construct decision tree classifiers. We discuss in detail various techniques for parallel divide-and-conquer and extend these techniques to handle efficiently disk-resident data. Furthermore, a generic technique for parallel out-ofcore divide-and-conquer problems is suggested. We present pCLOUDS, the parallel version of the decision tree classifier algorithm CLOUDS, capable of handling large outof-core data sets. pCLOUDS exhibits excellent speedup, sizeup and scaleup properties which make it a competitive tool for data mining applications. We evaluate the performance of pCLOUDS for a range of synthetic data sets on
منابع مشابه
On the Granularity of Divide-and-Conquer Parallelism
This paper studies the runtime behaviour of various parallel divide-and-conquer algorithms written in a non-strict functional language, when three common granularity control mechanisms are used: a simple cut-off, a priority thread creation and a priority scheduling mechanism. These mechanisms use granularity information that is currently provided via annotations to improve the performance of th...
متن کاملFree Vibration Analysis of Repetitive Structures using Decomposition, and Divide-Conquer Methods
This paper consists of three sections. In the first section an efficient method is used for decomposition of the canonical matrices associated with repetitive structures. to this end, cylindrical coordinate system, as well as a special numbering scheme were employed. In the second section, divide and conquer method have been used for eigensolution of these structures, where the matrices are in ...
متن کاملParallel Rule Induction with Information Theoretic Pre-Pruning
In a world where data is captured on a large scale the major challenge for data mining algorithms is to be able to scale up to large datasets. There are two main approaches to inducing classification rules, one is the divide and conquer approach, also known as the top down induction of decision trees; the other approach is called the separate and conquer approach. A considerable amount of work ...
متن کاملLecture 3 — Algorithmic Techniques and Analysis
Given an algorithmic problem, where do you even start? It turns out that most of the algorithms follow several well-known techniques. For example, when solving the shortest superstring (SS) problem we already mentioned three techniques: brute force, reducing one problem to another, and the greedy approach. In this lecture, we will go over these techniques, which are key to both sequential and p...
متن کاملChemometrics-enhanced Classification of Source Rock Samples Using their Bulk Geochemical Data: Southern Persian Gulf Basin
Chemometric methods can enhance geochemical interpretations, especially when working with large datasets. With this aim, exploratory hierarchical cluster analysis (HCA) and principal component analysis (PCA) methods are used herein to study the bulk pyrolysis parameters of 534 samples from the Persian Gulf basin. These methods are powerful techniques for identifying the patterns of variations i...
متن کامل